Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[#1135] improvement(docs): Add docs about tables advanced feature like partitioning #1203

Merged
merged 22 commits into from
Jan 2, 2024

Conversation

yuqi1129
Copy link
Contributor

What changes were proposed in this pull request?

Add docs about the details of table partitioning, bucketing, and sorting order.

Why are the changes needed?

The document is mandatory for users.

Fix: #1135

Does this PR introduce any user-facing change?

N/A

How was this patch tested?

N/A

@yuqi1129 yuqi1129 self-assigned this Dec 20, 2023
| `list` | Partition the table by a list value | Any | Any | `{"strategy":"list","fieldNames":[["dt"],["city"]]}` | `Transforms.list(new String[] {"dt", "city"})` | `PARTITION BY list(dt, city)` |
| `range` | Partition the table by a range value | Any | Any | `{"strategy":"range","fieldName":["dt"]}` | `Transforms.range(20, "score")` | `PARTITION BY range(score)` |

Except the strategies above, you can use other functions strategies to partition the table, for example, the strategy can be `{"strategy":"functionName","fieldName":["score"]}`. The `functionName` can be any function name that you can use in SQL, for example, `{"strategy":"toDate","fieldName":["score"]}` is equivalent to `PARTITION BY toDate(score)` in SQL.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should add this to the document.

All transforms must return null for a null input value.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not so sure about this point as it's only appliable to Iceberg currently, I need to check this for Hive.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, if we don't follow this, should we explain null input?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not so sure about this point as it's only appliable to Iceberg currently, I need to check this for Hive.

Can you help confirm this, @mchades?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this is not determined by Gravitino, but rather depends on the underlying catalog.

Copy link
Contributor

@mchades mchades left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Expression section should be an independent chapter, making it easier for Partitioning, Bucket, and sortOrder to reference it.

docs/manage-metadata-using-gravitino.md Outdated Show resolved Hide resolved
The `score`, `dt`, and `city` appearing in the table below refer to the field names in a table.
:::

| Function strategy | Description | Source types | Result type | Json example | Java example | Equivalent SQL semantics |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, Gravitino do not care about the source type and result type, the type limitation depends on the underlying catalog

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@qqqttt123 What's your opinion?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For a transform, source type and result type are important. Gravitino may not care. But users will care about it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about adding links to different catalog partitioning docs?

For the same type and partitioning strategy, it may be feasible in catalogA but prohibited in catalogB, as this is likely dependent on the catalog's implicit type conversion strategy.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK for me.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This may require a few more PRs to refine it, so I will add an issue about it later.

docs/manage-metadata-using-gravitino.md Outdated Show resolved Hide resolved
docs/manage-metadata-using-gravitino.md Outdated Show resolved Hide resolved
docs/manage-metadata-using-gravitino.md Outdated Show resolved Hide resolved
docs/manage-metadata-using-gravitino.md Outdated Show resolved Hide resolved
docs/manage-metadata-using-gravitino.md Outdated Show resolved Hide resolved
docs/manage-metadata-using-gravitino.md Outdated Show resolved Hide resolved
docs/manage-metadata-using-gravitino.md Outdated Show resolved Hide resolved
@yuqi1129
Copy link
Contributor Author

The Expression section should be an independent chapter, making it easier for Partitioning, Bucket, and sortOrder to reference it.

The first version I did used a separate chapter to describe it. It seems a bit isolated, so I merged it into this file.

docs/table-partitioning-bucketing-sort-order.md Outdated Show resolved Hide resolved
docs/table-partitioning-bucketing-sort-order.md Outdated Show resolved Hide resolved
docs/table-partitioning-bucketing-sort-order.md Outdated Show resolved Hide resolved
docs/table-partitioning-bucketing-sort-order.md Outdated Show resolved Hide resolved
docs/table-partitioning-bucketing-sort-order.md Outdated Show resolved Hide resolved
@yuqi1129
Copy link
Contributor Author

@jerryshao
Can you spare some time to review the PR?

Copy link
Member

@justinmclean justinmclean left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Needs some minor improvements. Also there are several double blank lines.

docs/manage-metadata-using-gravitino.md Outdated Show resolved Hide resolved
docs/table-partitioning-bucketing-sort-order.md Outdated Show resolved Hide resolved
docs/table-partitioning-bucketing-sort-order.md Outdated Show resolved Hide resolved
docs/table-partitioning-bucketing-sort-order.md Outdated Show resolved Hide resolved
docs/table-partitioning-bucketing-sort-order.md Outdated Show resolved Hide resolved
docs/table-partitioning-bucketing-sort-order.md Outdated Show resolved Hide resolved
docs/table-partitioning-bucketing-sort-order.md Outdated Show resolved Hide resolved
docs/table-partitioning-bucketing-sort-order.md Outdated Show resolved Hide resolved
docs/table-partitioning-bucketing-sort-order.md Outdated Show resolved Hide resolved
docs/table-partitioning-bucketing-sort-order.md Outdated Show resolved Hide resolved
@yuqi1129
Copy link
Contributor Author

@jerryshao Can you spare some time to review the PR?

@jerryshao
Please take time to review it, thanks.

docs/table-partitioning-bucketing-sort-order.md Outdated Show resolved Hide resolved
docs/table-partitioning-bucketing-sort-order.md Outdated Show resolved Hide resolved
Comment on lines +69 to +82
| Bucket strategy | Description | JSON | Java |
|-----------------|-------------------------------------------------------------------------------------------------------------------------------|----------|------------------|
| hash | Bucket table using hash. Gravitino will distribute table data into buckets based on the hash value of the key. | `hash` | `Strategy.HASH` |
| range | Bucket table using range. Gravitino will distribute table data into buckets based on a specified range or interval of values. | `range` | `Strategy.RANGE` |
| even | Bucket table using even. Gravitino will distribute table data, ensuring an equal distribution of data. | `even` | `Strategy.EVEN` |

- Number. It defines how many buckets you use to bucket the table.
- Function arguments. It defines the arguments of the strategy above, Gravitino supports the following three kinds of arguments, for more, you can refer to Java class [FunctionArg](https://github.com/datastrato/gravitino/blob/main/common/src/main/java/com/datastrato/gravitino/dto/rel/expressions/FunctionArg.java) and [DistributionDTO](https://github.com/datastrato/gravitino/blob/main/common/src/main/java/com/datastrato/gravitino/dto/rel/DistributionDTO.java) to use more complex function arguments.

| Expression type | JSON example | Java example | Equivalent SQL semantics | Description |
|-----------------|----------------------------------------------------------------|-------------------------------------------------------------------------------------------|--------------------------|-----------------------------------|
| field | `{"type":"field","fieldName":["score"]}` | `FieldReferenceDTO.of("score")` | `score` | The field reference value `score` |
| function | `{"type":"function","functionName":"hour","fieldName":["dt"]}` | `new FuncExpressionDTO.Builder().withFunctionName("hour").withFunctionArgs("dt").build()` | `hour(dt)` | The function value `hour(dt)` |
| constant | `{"type":"literal","value":10, "dataType": "integer"}` | `new LiteralDTO.Builder().withValue("10").withDataType(Types.IntegerType.get()).build()` | `10` | The integer literal `10` |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The content here has no any introduction about bucketing, directly introducing Strategy, Number... is very hard for user to understand what is it. You should have a basic introduction about the bucketing, and the required fields of bucketing.

Also not only for bucketing, but also for partitioning and sort ordering.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This part is in the manage-metadata-using-gravitino.md and I have updated it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean you'd better have a paragraph introducing all the required fields to combine partitioning, bucketing and sort ordering. For a user who doesn't have any background of these things, directly introducing Strategy, Number and others seems not so easy to understand.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added.

docs/table-partitioning-bucketing-sort-order.md Outdated Show resolved Hide resolved
@yuqi1129
Copy link
Contributor Author

yuqi1129 commented Jan 2, 2024

@jerryshao
Please take a look if the modification is reasonable, thanks.

<Tabs>
<TabItem value="shell" label="Shell">

```shell
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some of the doc I see using bash, I think you'd better unifying them all.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I remember we use bash thorough the doc manmage-metadata-using-gravitino.md, maybe I should change them all.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We use shell in other places, so better to change to shell.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using bash or shell may affect the code highlighting, although I am not entirely certain. I suggest that you take a look at the actual effect on the website.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have tested locally and it seems fine to use shell.

2. Change the language of some code blocks from `bash` to `shell`
@jerryshao jerryshao added need backport Issues that need to backport to another branch branch-0.3 labels Jan 2, 2024
Comment on lines 282 to 290
new DistributionDTO.Builder()
.withStrategy(Strategy.HASH)
.withNumber(4)
.withArgs(FieldReferenceDTO.of("id"))
.build(),
// SORTED BY name asc
new SortOrderDTO[] {
SortOrders.of(FieldReferenceDTO.of("score"), SortDirection.ASCENDING, NullOrdering.NULLS_LAST)
});
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should not use xxxDTO, instead we should use some methods in APIs, right? @mchades

Copy link
Contributor

@mchades mchades Jan 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, there are several static methods in APIs, xxxDTO is simply a representation of the intermediate result of JSON deserialization.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The method isn't yet available in the APIs,
image

So, do I need to add it to this PR or use another PR later?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use com.datastrato.gravitino.rel.expressions.NamedReference#field(java.lang.String)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please fix it here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are two follow-things:

  1. "Column" should have an implementation used by client, not DTO, @mchades can you please work on this?
  2. Make sure all the IT codes use APIs, not DTOs for partitioning/sorting/distribution. @yuqi1129 can you please work on this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please fix it here.

done

Please fix it here.

done

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. "Column" should have an implementation used by client, not DTO

Tracked by #1292

@jerryshao
Copy link
Contributor

@mchades would you please help to review again?

@jerryshao jerryshao merged commit 04adf54 into apache:main Jan 2, 2024
4 checks passed
github-actions bot pushed a commit that referenced this pull request Jan 2, 2024
…e partitioning (#1203)

### What changes were proposed in this pull request?

Add docs about the details of table partitioning, bucketing, and sorting
order.

### Why are the changes needed?

The document is mandatory for users. 

Fix: #1135 

### Does this PR introduce _any_ user-facing change?

N/A

### How was this patch tested?

N/A

---------

Co-authored-by: Jerry Shao <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
need backport Issues that need to backport to another branch
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Improvement] Add docs about partitioned, sorted order and bucketed tables
5 participants